Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance

نویسندگان

  • Carolin Strobl
  • Achim Zeileis
چکیده

Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures. For the test of Breiman and Cutler (2008), we investigate the statistical properties and find that the power of the test depends both on the sample size and the number of trees in an undesirable way that nullifies any significance judgments. Moreover, the specification of the null hypothesis of this test is discussed in the context of correlated predictor variables.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Zeileis Danger : High Power ! – Exploring the Statistical Properties of a Test for Random Forest Variable

Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures, especially in the context of correlated pred...

متن کامل

Identification of Statistically Significant Features from Random Forests

Embedded feature selection can be performed by analyzing the variables used in a Random Forest. Such a multivariate selection takes into account the interactions between variables but is not easy to interpret in a statistical sense. We propose a statistical procedure to measure variable importance that tests if variables are significantly useful in combination with others in a forest. We show e...

متن کامل

Inferring statistically significant features from random forests

Embedded feature selection can be performed by analyzing the variables used in a Random Forest. Such a multivariate selection takes into account the interactions between variables but is not straightforward to interpret in a statistical sense. We propose a statistical procedure to measure variable importance that tests if variables are significantly useful in combination with others in a forest...

متن کامل

Comparing Different Modeling Techniques for Predicting Presence-absence of Some Dominant Plant Species in Mountain Rangelands, Mazandaran Province

In applied studies, the investigation of the relationship between a plant species and environmental variables is essential to manage ecological problems and rangeland ecosystems. This research was conducted in summer 2016. The aim of this study was to compare the predictive power of a number of Species Distribution Models (SDMs) and to evaluate the importance of a range of environmental variabl...

متن کامل

ggRandomForests: Exploring Random Forest Survival

Random forest (Breiman 2001a) (RF) is a non-parametric statistical method requiring no distributional assumptions on covariate relation to the response. RF is a robust, nonlinear technique that optimizes predictive accuracy by fitting an ensemble of trees to stabilize model estimates. Random survival forests (RSF) (Ishwaran and Kogalur 2007; Ishwaran, Kogalur, Blackstone, and Lauer 2008) are an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008